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Abstract. Suppose we are asked to preprocess a string s[l..n] such that 
later, given a substring's endpoints, we can quickly count how many 
distinct characters it contains. In this paper we give a data structure for 
this problem that takes nHo(s) + 0(n) + o(nHo(s)) bits, where Hq(s) is 
the Oth-order empirical entropy of s, and answers queries in C(log 1+E n) 
time for any constant e > 0. We also show how our data structure can 
be made partially dynamic. 

1 Introduction 

Coloured range counting is a well-studied problem with applications in, e.g., 
computational geometry, database research and bioinformatics. For this general 
problem, we are asked to store a set of n coloured points in R d such that later, 
given an axis-aligned box, we can quickly count the number of distinct colours 
it contains. Most papers on this problem have focused on d > 2 dimensions (see, 
e.g., [5]); the upper bound for general static one-dimensional coloured range 
counting has not changed since 1995, when Bozanis, Kitsios, Makris and Tsaka- 
lidis [1] gave an 0(n)-word data structure that answers queries in O(logn) time. 
Recently, however, Gagie, Navarro and Puglisi [3] considered the special case in 
which the coloured points are the integers 1, . . . , n. Storing these points is equiv- 
alent to storing a string s[l..n] over an alphabet whose size a is the number of 
distinct colours, such that later, given a substring's endpoints, we can quickly 
count how many distinct characters it contains. 

Gagie et al. gave a data structure that takes nlogcr + O(nloglogn) bits and 
answers queries in O(logrt) time. (In this paper log means log 2 .) Their solution 
is built on work by Muthukrishnan [8] about coloured range queries in strings. 
Muthukrishnan defined C[l..n] to be the array in which each cell C[q] stores the 
largest value p < q such that s\p] = s[q] (or if no such p exists). He observed 
that s[q] is the first occurrence of that distinct character in s[i..j] if and only if 
i < g < j and C[q] < i. Therefore, the number of distinct characters in s[i..j] is 
the number of values in C[i..j] strictly less than i. Gagie et al. noted that, if we 
store C in a wavelet tree [4], which takes nlogn + o(nlogn) bits, then we can 



count all such values in O(logn) time; for details, see [6]. This is already a slight 
improvement over the bounds we achieve with Bozanis et al.'s data structure [1], 
but Gagie et al. showed it can be reduced to n log er-|-C>(n log logn) by modifying 
the wavelet tree. 

In Section 2 we describe a simple data structure that achieves essentially the 
same bound as Gagie et al.'s. In Section 3 we extend the ideas from Section 2 
to build a data structure that takes nH (s) + 0{n) + o(nH (s)) bits, where 
Ho{s) < loger is the Oth-order empirical entropy of s, and answers queries in 
C(log 1+c n) time for any constant e > 0. This may be useful for applications 
such as tracking the unique visitors to a website, allowing us to count the unique 
visitors in any given interval. In Section 4 we show how our data structure can 
be made partially dynamic. 

2 Simple Blocking 

In this section we give a simple proof that, using two normal wavelet trees and a 
straightforward encoding of C, we need store only (1 + o(l)) (n log er + n log logn) 
bits to answer queries in O(logn) time. Without loss of generality, assume a = 
o(nj log n); otherwise, we achieve our desired bound by simply storing C in a 
single, normal wavelet tree. Our idea is to break s into blocks of length a log n and 
encode the entry C[q] differently depending on whether the previous occurrence 
s[p] of the character s[q] is contained in the same block. If s[p] is contained in 
the same block as s[q], then we write C[q] as the [log 6] -bit offset of p within the 
block; otherwise, we write it as the [log n] -bit binary representation of p. Notice 
that, for each block, there are at most a entries of C encoded as [log n] -bit 
numbers. 

We build a bitvector indicating how each entry of C is encoded, which takes 
n + o(n) bits. We build one wavelet tree storing all the [log 6] -bit encodings, 
which takes at most n log b + o(n log b) = (l + o(l))(nloga + nloglogn) bits, and 
another storing all the [log n] -bit encodings, which takes at most a\n/b~\ logn + 
o(a\n/b~\ logn) = n + o(n) bits. Notice that, if s[q] is the first occurrence of that 
distinct character in s[i..j] and C[q] is encoded in [log 6] bits, then s[q] must 
be between s[i] and the end of the block containing s[i}. We can count all such 
characters in 0(\ogb) = 0(loger + log logn) time using the bitvector and the 
first wavelet tree. We can count all the other first occurrences in O(logn) time 
using the bitvector and the second wavelet tree. 

Theorem 1. Given a string s[l..n], we can build a data structure that takes 
(1 + o(l))(n logo - + n log logn) bits such that later, given a substring's endpoints, 
in O(logn) time we can count how many distinct characters it contains. 

Notice that, if a > logn, then Gagie et al.'s data structure is within a con- 
stant factor of being succinct and the data structure we just presented is within 
a factor of 2 of being succinct. If a < logn, then we can store s in a multiary 
wavelet tree [2], which takes nHo(s) + o(n) bits, and answer any query by enu- 
merating the characters in the alphabet and, for each one, using two C(l)-time 
rank queries to see whether it occurs in the given substring. 



Corollary 1. Given a string s[l..n], we can build a data structure that takes 
2n log a+o(n log a) bits such that later, given a substring's endpoints, in O(logn) 
time we can count how many distinct characters it contains. 

3 Multi-Size Blocking 

In this section we extend our idea from the previous section so that, instead of 
encoding entries of C differently for only two block sizes — i.e., a logn and n — 
we use many block sizes. In particular, we use 0(loglogn/ log(l + 8)) different 
block sizes, 

Y 2 1 + s 2 max (( 1 + <5 ) 2 > 2 ) 2 max (( 1+<5 ) 3 ' 3 ) 2 max (( 1+<5 ) 4 ' 4 ) n 

where 8 > is a value we will specify later. Also, for each block size b, we 
consider s to consist of about 2n/b evenly overlapping blocks, 

s[l..b], s[b/2..3b/2], s[b+ 1..26], s[36/2 + 1. .56/2], . . . , s[n - b + 1, n] . 

If C[q] = p and the smallest block containing both s[p] and s[q] has size b, then 
we write C[q] as the [log 6] -bit offset of p within the leftmost of the (at most) two 
blocks of size b containing s[q]. Notice log 6 < (1 + 8) \og(q — p) + 1; calculation 
shows that the total size of all the offsets is at most (I + S)nH (s) + 0(n) bits. 

Let t be a string indicating whether each entry of C[q] is and, if not, the 
block size used for it. We build a multiary wavelet tree [2] storing t. Since we 
can always encode a block size b using 0(loglog6) bits — even if 8 is very small, 
thanks to the max in the definition of the block sizes — more calculation shows 
that H (t) = O(log(H (s) + I)). It follows that, if H (s) grows without bound 
as n goes to infinity, then the size of the tree is o(nH (s)) bits; otherwise, it is 
0(n) bits. Using the tree, in 0(1) time we can count all the characters whose 
first appearance in s is in s[i..j]. 

For each block size b, we build a wavelet tree storing all the [log 6] -bit encod- 
ings. By the same calculation as for the offsets, these wavelet trees take a total 
of (1 + 8)nHo(s) + 0(n) + o(nHo(s)) bits. Notice that, for any block size b, if s[q] 
is the first occurrence of that distinct character in s[i..j] and C[q] is encoded in 
[log 6] bits, then s[q] must be between s[i] and the end of the rightmost of the 
(at most) two blocks of size b containing s[i\. Using the multiary wavelet tree 
and the wavelet tree for block size b, in C(log b) time we can count all such char- 
acters in the right halves of both the leftmost and the rightmost blocks of size 
b containing s[i]. Since the right half of the leftmost block is the left half of the 
rightmost block, the sum is the total number of such characters. It follows that 
we can count all the distinct characters in s[i..j] in 0(lognloglogn/log(l + 8)) 
time. Choosing 8 = 1/ log logn, for example, yields the following theorem: 

Theorem 2. Given a string s[l..n], we can build a data structure that takes 
nHo(s) + 0(n) + o(nHo(s)) bits such that later, given a substring's endpoints, in 
0(logn (log logn) 2 ) time we can count how many distinct characters it contains. 



A closer analysis shows that the time to count the distinct characters in s[i..j] 
is C(log(j — z + l) log log n log log (j — i + 2)). In a future version of this paper 
we will improve this bound to C(log(j — i + 1) + min(log(j — i + 1), log log n) 2 ) 
without increasing our space bound. As far as we know, no other data structure 
for coloured range counting has a non-trivial upper bound depending only on 
the size of the range. 

4 Partial Dynamism 

Suppose s[i x ] and s[i y ] are the last occurrences of x and y strictly before s\j], 
and s[k x ] and s[k y ] are their first occurrences strictly after s[j]. Then to change 
s[j] from an re to a y, we need only reset C[j] — i y , C[k x ] — i x and C[k y ] — j. To 
delete a character from s, we replace it with a special null character not in the 
alphabet (which we search for and exclude when performing queries). To append 
a character to s, we need only append an entry to C. Assume we have already 
found all the necessary positions using, e.g., a rank/select data structure for s 
(although, given some, we can find the others using our data structure from 
Section 3); in this paper we focus on how to update entries of C in our data 
structure's representation. 

Makinen and Navarro [7] gave a dynamic data structure that stores a bitvec- 
tor v of length n in nH n (v) + o(n) bits and supports rank, select, insert and 
delete in O(logn) time. Using this dynamic bitvector data structure, they gave 
an efficient dynamic wavelet tree data structure. If we simply replace by stan- 
dard dynamic wavelet trees the two static wavelet trees in our data structure 
from Theorem 1, then our space bound does not change and it takes 0(log 2 nj 
time both to count the number of distinct characters in a given substring and 
to update an entry of C. 

If we simply replace with standard dynamic wavelet trees all the static 
wavelet trees (including the multiary wavelet tree) in our data structure from 
Theorem2, then calculation shows we use nH n (s)+0(n)+o(n(H n (s) + logloglogn)) 
bits and (9((lognloglogn) 2 ) time both to count the number of distinct char- 
acters in a given substring and to update an entry of C. This space bound is 
o(n log log log n) bits larger than the space bound in Theorem 2 because t - 
the string indicating the block size used for each entry of C in Section 3 — is 
over an alphabet of size O(loglogn/log(l + 5)). Therefore, whereas a multiary 
wavelet tree for t takes nH (t) + o(n) = 0(n) + o(nH (s)) bits, a standard 
wavelet tree for t (static or dynamic) takes nHo(t) + o(n log log log n) = 0(n) + 
o(n(Ho(s) + log log log n)) bits. If we use a Huffman-shaped dynamic wavelet 
tree to store i, however, then it takes only n(H (t) + 1) + o(n(H (t) + 1)) = 
0(n) + o(nHo(s)) bits. We will give details in a future version of this paper. 

Lemma 1. We can make our data structure from Theorem 2 dynamic, without 
changing its space bound, such that it takes C((lognloglogn) 2 ) time both to 
count the number of distinct characters in a given substring and to update an 
entry of C. 



Theorem 3. Suppose we have access to a dynamic rank/select data structure 
storing s such that queries, insertions and deletions all take ((log n log log n) 2 ) 
time. Then we can build another data structure that takes nH (s) + (D(n) + 
o(nH (s)) bits such that in O((lognloglogn) 2 ) time we can replace, delete or 
append a character or, given a substring's endpoints, count how many distinct 
characters it contains. 
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